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Abstract 

The overlap lattice-Dirac operator contains the sign function e{H). Recent practical 
implementations replace e{H) by a ratio of polynomials, HPn{H'^)/Qn{H'^), and require 
storage of 2?! + 2 large vectors. Here I show that one can use only 4 large vectors at the 
cost of executing the core conjugate algorithm twice. The slow-down might be less than 
by a factor of 2, depending on the architecture of the computer one uses. 
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The overlap [1, 2] has produced a new lattice Dirac operator [3, 4]. Using this operator 
one can define QCD on the lattice with explicit exact global chiral symmetries. Full, exact 
chiral symmetry on the lattice without fine tuning could not be achieved with previous 
lattice Dirac operators. The new operator is given by: 

^ = i±f^, (1) 

where H = 75-D'. There is some freedom in choosing D'. The simplest choice is D' = Dw 
Dw, the Wilson-Dirac operator, in d > 3-dimensions, is taken with hopping parameter k 
close to the middle of the allowed range (2^, 2^34)- Choosing D' = Dw maximizes the 
sparseness of H and thus minimizes the cost of evaluating Hb for generic vectors b. 

Recent implementations of the overlap lattice Dirac operator [5, 6] employ a sum of 
terms of the form jj2\^s with s = 1, .., n and < cr-^ < a^... < u": 

H 1 

e{H)= lim - Yw'—, -. (2) 

8=1 

Using the method of multiple shifts [7] , the number of H'^p operations required to evaluate 
the action of the right hand side of equation (2) on vectors p for arbitrary finite n becomes 
independent of n. Actually, the number of operations is not much more than required to 
compute jja^o-i ^ by the simple conjugate gradient (CG) algorithm [8]. This saving comes 
at some expense: one needs to store 2n + 2 vectors. 

The main objective of this paper is to suggest a way to avoid storing 2n + 2 vectors. 
For large lattices this might be necessary. The saving in storage comes at the price of 
executing two passes over the basic CG algorithm instead of a single pass. Thus, at the 
cost of doubling the amount of H^p operations, storage also becomes almost n independent, 
and does not exceed substantially the requirements of computing -jj^^b by CG. The basic 
trick is similar to the way large storage demands are avoided if one uses the Lanczos method 
[9] to compute not only eigenvalues but also a few eigenvectors. 

Below, I shall first review the single-pass method following the notations and methods 
of Jegerlehner [7]. Next, I construct the two-pass method. I end discussing possible 
advantages of using the two-pass implementation. 

For large enough n, typically, is so small relative to the lowest eigenvalue of i?^ 
that nothing is lost by driving the algorithm with the inversion of H"^ itself, rather than 
H"^ + cr^. In any case, the algorithm is easily altered to make H"^ + the driver. Below, 
I use just H"^. I assume that a vector b is given and my objective is to compute e{H)b. 

Pseudocode for the single-pass version. 
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Input vector: b. 

Vector variables: w, r, p, x^, p^. 

Scalar variables: a, P, p, tolerance, 6norm- 

Initialize: tq = b, po = b, po = {b,b), 6norm = y/po, = 0, (3-i = 1, for s = 1,2, ..,n 
{4 = 0, = b, Ci = 1, Co = !}• 



Iterate: for z = 0, 1, 2, 3, 
{ 



Wi = H'^pi 

Pi = -Pi/{Pi,Wi) 



i. for s = 1, 2, ...n: 
{ 



Ti+i = ri + (3iWi 
Pi+i = {ri+i,ri+i) 
"i+i = Pi+i/Pi 
Pi+i = n+i + 



ii. for s = 1, 2, ...n: 



} 

If y^Pi+i < tolerance • 6norm) exit. 
} 

Let me make a few comments on the algorithm: 

• The denominator in line (A) above will underflow for larger cr* before convergence has 
been reached for the lower cr*. In practice the number of already converged s- values is 
stored and updated, so for any s the CG algorithm is not executed past convergence. 

• Line (B) differs slightly from the explicit pseudo-code in [7]. 

• If one uses + as driver in iJ^'s stead, p can be replaced by p^, resulting in some 
small saving. 
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• The core conjugate gradient constitutes of what is left of the above after the blocks i 
and ii are excluded. 

We are only interested in obtaining a numerical approximation to the vector y = 
e{H)b: The output we need is the linear combination 

n 

Vi+i^^w^xl+i- (3) 

s=l 

In equation (3) I assumed that convergence of the CG procedure is achieved at step i + 1 
for all s. To avoid underflows, one sometimes needs to replace i by ig where ig is the 
s-dependent point of convergence. I sketched above how this is done. 

It is obvious that we are not interested in the individual vectors but only in 

one particular linear combination. The idea is then to calculate only the needed linear 
combination itcrativcly, without extra storage. Just as in the Lanczos case, this requires 
an extra pass over the core CG procedure. 

Each xf_^_i in equation (3) is a linear combination of the s-independent Krylov vectors 
Tj. Thus, yi+i, which is all we care to know, is also a linear combination of r^'s, and 
the single place s-dependence enters is in their coefficients. One cannot compute only the 
linear combination of equation (3) in one pass: The contributions from Krylov vectors 
computed in the early stages of the iteration enter with coefficients that are determined 
only in later iterations. Thus, a first pass is needed for the sole purpose of computing the 
coefficients aj and Pj up to the point where the driving GG process has converged. With 
this information, one can calculate the coefficients and also the points of convergence for 
the different masses (s- values). This information is used to compute all needed coefficients 
of the s-independent vectors rj in the decomposition of yi+i. 

The basic recursion determining the Xj can be read off the algorithm listed above: 

In equation (4) /?| = /3jC|+i/C| '^j+i ~ '^i+i(C|+i/C|)^- Starting from equation (4) I 
derived the following expression: 

yi+i = ^ RkTk 



Rk 



k=0 

,S ^k + l + l^k + l 

e:=i«^^c+i, ' foTk=i+i 



Ya=o Pk+i (ni=i Er=i 



(5) 

for i > k; 
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In equation (5) I adopted the convention that (^11^=1 '^k+jj = 1- Again, the upper bounds 
on the sums over s in equation (5) must be altered to prevent underflows, but the informa- 
tion is available. So, in practice the sums over s only include the s-values that correspond 
to CG processes that have not yet converged at iteration k. 

Equation (5) clarifies why a single pass would not work without the extra storage: 
Rk, for low A;'s, depends on aj, (3j, with j-values up to i. However, the coefficients 
Rk, k = 0, + 1 can be computed after the first pass. To compute y^+i we need the 
vectors and hence a second pass. In the second pass no inner products are required, 
since the coefficients aj and Pj are already known. 

Pseudocode for the two-pass version. 

Input vector: b. 

Vector variables: w, r, p, x. 

Scalar variables and arrays: aj, Pj, (j, p, Rf-, tolerance, 6norm- 

1. First pass. 

Initialize: tq = b, po = b, po = {b, b), 6norm = y/Po, = 0. 

Iterate: for i = 0,1, 2, 3, ...: 

{ 

(3i = -pi/{Pi,Wi) 

Tj+i = + (3iWi 

pi+1 = {ri+i,ri+i) 
tti+i = Pi+i/Pi 
Pi+i = n+i + ai+ipi 

If y^Pi+i < tolerance 6norm, exit. 
} 

2. Compute C| from line (A) in the single pass version, and Rk from equation (5). 

3. Second pass. 

Initialize: tq = b, po = b, xq = Rob. 
Iterate: for A; = 0, 1, 2, 3, ...i: 
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{ 



H^Pk 

Tk + PkWk 



Pk+1 



rk+1 + ak+iPk 



Xk+l 



Xk + Rk+l^k+l 



} 



For large enough lattices one might expect the second version to take twice as long as 
the first. However, the nonuniform architecture of the memory is crucial and the actual 
efficiency attainable in practice can vary. I used a high level Fortran 90 code modeled 
on a package described in [10] and ran it at 64 bit precision on an SGI O2000 with four 
processors, each with 4MB cache memory. All parallelizations were done using automatic 
options. I tested the methods on a three dimensional system with gauge group SU{2), two 
flavors [5], and lattice size 8"^. I used gauge configurations generated from the single pla- 
quette gauge action with coupling /3 = 3.5 and evaluated several of the lowest eigenvalues 
of DD^ . n was set to 32. This computation is of the same order of magnitude as a few 
D-inversions. I found that the second method actually ran faster by 30% than the first, at 
the same accuracy. The observed speed-up instead of the expected slow-down most likely 
reflects cache usage. Carefully optimized codes ought to show at least some slow-down in 
the two-pass method. This slow-down will not exceed a factor of 2. 

The above test was done using the original method of [5], rather than the refined 
version of [6]. In the original method, the quantities Ws and 0"^ are extracted from [11]: 



In the refined version of [6] different weights and pole locations — cr* are used. The 
advantage of the refinement is claimed to be that similar accuracies can be achieved with 
smaller values of n. This does not come entirely for free though. In the original method 
the spectrum of e{H) is approached from within the interval (—1, 1) while in the refined 
method of [6] the approach is oscillatory around ±1. Even tiny excursions of the spectrum 
of 75e(iy) below —1 can cause problems when one inverts D. In applications one probably 
would like to optimize for the accuracy of 1/D rather than the accuracy of e{H). In 
the two-pass method the dependence of the overall computational cost on n has been 
sufficiently weakened to make the original version of the rational approximation possibly 
more attractive. 

In a recent paper [12] another method was presented, also requiring no extra storage. 
This new method employs a Lanczos iteration (similar to CG) to tridiagonalize H. Next, 
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the square root of the relatively small triangular matrix is taken by some spectral method 
(this is similar to old implementations of the overlap [13]). To compute the action of the 
approximate e{H) on a vector b a second pass over the Lanczos procedure is needed. So, 
the ideas of using two passes in order to avoid extra storage are similar. 

It is premature at the moment to decide which method is best; the answer may end up 
dependent on the particular architecture of the computer one uses. Therefore, it is useful to 
keep all possibilities in mind, and avoid getting locked into one particular implementation. 
New tricks producing large savings are still quite possible, given the limited time people 
have been experimenting with implementations of D. In spite of the quite short time that 
has passed since [3], looking back to the first implementations in [13], followed by [14] 
where the Newton iteration was first applied, to more recent work [5, 6, 12], I find the 
amount of progress quite impressive. 
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